The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises from the fact that translation is a non-monotonic sequence transduction task due to word ordering differences between languages -- this clashes with the monotonic nature of ASR. Therefore, we propose to generate ST tokens out-of-order while remembering how to re-order them later. We achieve this by predicting a sequence of tuples consisting of a source word, the corresponding target words, and post-editing operations dictating the correct insertion points for the target word. We examine two variants of such operation sequences which enable generation of monotonic transcriptions and non-monotonic translations from the same speech input simultaneously. We apply our approach to offline and real-time streaming models, demonstrating that we can provide explainable translations without sacrificing quality or latency. In fact, the delayed re-ordering ability of our approach improves performance during streaming. As an added benefit, our method performs ASR and ST simultaneously, making it faster than using two separate systems to perform these tasks.
translated by 谷歌翻译
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of unpaired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.
translated by 谷歌翻译
端到端(E2E)模型在口语理解(SLU)系统中变得越来越流行,并开始实现基于管道的方法的竞争性能。但是,最近的工作表明,这些模型努力以相同的意图概括为新的措辞,这表明模型无法理解给定话语的语义内容。在这项工作中,我们在E2E-SLU框架内的未标记文本数据中预先训练了在未标记的文本数据上进行预先训练的语言模型,以构建强大的语义表示。同时结合语义信息和声学信息可以增加推理时间,从而在语音助手等应用程序中部署时会导致高潜伏期。我们开发了一个2频道的SLU系统,该系统使用第一张音频的几秒钟的声学信息进行低潜伏期预测,并通过结合语义和声学表示在第二次通过中进行更高质量的预测。我们从先前的2次端到端语音识别系统上的工作中获得了灵感,该系统同时使用审议网络就可以在音频和第一通道假设上进行。所提出的2个通用SLU系统在Fluent Speech命令挑战集和SLURP数据集上优于基于声学的SLU模型,并减少了延迟,从而改善了用户体验。作为ESPNET-SLU工具包的一部分,我们的代码和模型公开可用。
translated by 谷歌翻译
事实证明,构象异构体在许多语音处理任务中都是有效的。它结合了使用卷积和使用自我注意的全球依赖性提取本地依赖的好处。受此启发,我们提出了一个更灵活,可解释和可自定义的编码器替代方案,分支机构,并在端到端语音处理中对各种远程依赖关系进行建模。在每个编码器层中,一个分支都采用自我注意事项或其变体来捕获远程依赖性,而另一个分支则利用带有卷积门控(CGMLP)的MLP模块来提取局部关系。我们对几种语音识别和口语理解基准进行实验。结果表明,我们的模型优于变压器和CGMLP。它还与构象异构体获得的最先进结果相匹配。此外,由于两分支结构,我们展示了减少计算的各种策略,包括在单个训练有素的模型中具有可变的推理复杂性的能力。合并分支的权重表明如何在不同层中使用本地和全球依赖性,从而使模型设计受益。
translated by 谷歌翻译
最先进的编码器模型(例如,用于机器翻译(MT)或语音识别(ASR))作为原子单元构造并端到端训练。没有其他模型的任何组件都无法(重新)使用。我们描述了Legonn,这是一种使用解码器模块构建编码器架构的过程,可以在各种MT和ASR任务中重复使用,而无需进行任何微调。为了实现可重复性,每个编码器和解码器模块之间的界面都基于模型设计器预先定义的离散词汇,将其接地到边缘分布序列。我们提出了两种摄入这些边缘的方法。一个是可区分的,可以使整个网络的梯度流动,另一个是梯度分离的。为了使MT任务之间的解码器模块的可移植性用于不同的源语言和其他任务(例如ASR),我们引入了一种模态不可思议的编码器,该模态编码器由长度控制机制组成,以动态调整编码器的输出长度,以匹配预期的输入长度范围的范围预训练的解码器。我们提出了几项实验来证明Legonn模型的有效性:可以重复使用德国英语(DE-EN)MT任务的训练有素的语言解码器模块,而没有对Europarl English ASR和ROMANIAN-ENGLISH进行微调(RO)(RO)(RO)(RO) -en)MT任务以匹配或击败相应的基线模型。当针对数千个更新的目标任务进行微调时,我们的Legonn模型将RO-EN MT任务提高了1.5个BLEU点,并为Europarl ASR任务降低了12.5%的相对减少。此外,为了显示其可扩展性,我们从三个模块中构成了一个legonn ASR模型 - 每个模块都在三个不同数据集的不同端到端训练的模型中学习 - 将降低的减少降低到19.5%。
translated by 谷歌翻译
会话双语语言包括三种类型的话语:两个纯粹单色类型和一个内侧型代码切换类型。在这项工作中,我们提出了一个综合框架,共同模拟包括双语语音识别的单声道和代码交换机子任务的可能性。通过定义具有标签到帧同步的单个子任务,我们的联合建模框架可以条件地分解,使得可以仅获得或可能不切换的最终双语输出,仅给出单格式信息。我们表明,该条件分解的联合框架可以由端到端可分解的神经网络进行建模。我们展示了我们拟议模型在单语和代码切换的语料中对双语普通话语音识别的效果。
translated by 谷歌翻译
随着自动语音处理(ASR)系统越来越好,使用ASR输出越来越令于进行下游自然语言处理(NLP)任务。但是,很少的开源工具包可用于在不同口语理解(SLU)基准上生成可重复的结果。因此,需要建立一个开源标准,可以用于具有更快的开始进入SLU研究。我们展示了Espnet-SLU,它旨在在一个框架中快速发展口语语言理解。 Espnet-SLU是一个项目内部到结束语音处理工具包,ESPNET,它是一个广泛使用的开源标准,用于各种语音处理任务,如ASR,文本到语音(TTS)和语音转换(ST)。我们增强了工具包,为各种SLU基准提供实现,使研究人员能够无缝混合和匹配不同的ASR和NLU模型。我们还提供预磨损的模型,具有集中调谐的超参数,可以匹配或甚至优于最新的最先进的性能。该工具包在https://github.com/espnet/espnet上公开提供。
translated by 谷歌翻译
Conventional conversation assistants extract text transcripts from the speech signal using automatic speech recognition (ASR) and then predict intent from the transcriptions. Using end-to-end spoken language understanding (SLU), the intents of the speaker are predicted directly from the speech signal without requiring intermediate text transcripts. As a result, the model can optimize directly for intent classification and avoid cascading errors from ASR. The end-to-end SLU system also helps in reducing the latency of the intent prediction model. Although many datasets are available publicly for text-to-intent tasks, the availability of labeled speech-to-intent datasets is limited, and there are no datasets available in the Indian accent. In this paper, we release the Skit-S2I dataset, the first publicly available Indian-accented SLU dataset in the banking domain in a conversational tonality. We experiment with multiple baselines, compare different pretrained speech encoder's representations, and find that SSL pretrained representations perform slightly better than ASR pretrained representations lacking prosodic features for speech-to-intent classification. The dataset and baseline code is available at \url{https://github.com/skit-ai/speech-to-intent-dataset}
translated by 谷歌翻译
The rapid development of remote sensing technologies have gained significant attention due to their ability to accurately localize, classify, and segment objects from aerial images. These technologies are commonly used in unmanned aerial vehicles (UAVs) equipped with high-resolution cameras or sensors to capture data over large areas. This data is useful for various applications, such as monitoring and inspecting cities, towns, and terrains. In this paper, we presented a method for classifying and segmenting city road traffic dashed lines from aerial images using deep learning models such as U-Net and SegNet. The annotated data is used to train these models, which are then used to classify and segment the aerial image into two classes: dashed lines and non-dashed lines. However, the deep learning model may not be able to identify all dashed lines due to poor painting or occlusion by trees or shadows. To address this issue, we proposed a method to add missed lines to the segmentation output. We also extracted the x and y coordinates of each dashed line from the segmentation output, which can be used by city planners to construct a CAD file for digital visualization of the roads.
translated by 谷歌翻译
As information extraction (IE) systems have grown more capable at whole-document extraction, the classic task of \emph{template filling} has seen renewed interest as a benchmark for evaluating them. In this position paper, we call into question the suitability of template filling for this purpose. We argue that the task demands definitive answers to thorny questions of \emph{event individuation} -- the problem of distinguishing distinct events -- about which even human experts disagree. We show through annotation studies and error analysis that this raises concerns about the usefulness of template filling evaluation metrics, the quality of datasets for the task, and the ability of models to learn it. Finally, we consider possible solutions.
translated by 谷歌翻译